4 research outputs found

    Artificial Vocal Learning guided by Phoneme Recognition and Visual Information

    Get PDF
    This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain

    Tone realisation for speech synthesis of Yorùbá

    No full text
    PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014Speech technologies such as text-to-speech synthesis (TTS) and automatic speech recognition (ASR) have recently generated much interest in the developed world as a user-interface medium to smartphones [1, 2]. However, it is also recognised that these technologies may potentially have a positive impact on the lives of those in the developing world, especially in Africa, by presenting an important medium for access to information where illiteracy and a lack of infrastructure play a limiting role [3, 4, 5, 6]. While these technologies continually experience important advances that keep extending their applicability to new and under-resourced languages, one particular area in need of further development is speech synthesis of African tone languages [7, 8]. The main objective of this work is acoustic modelling and synthesis of tone for an African tone,language: Yorùbá. We present an empirical investigation to establish the acoustic properties of tone in Yorùbá, and to evaluate resulting models integrated into a Hidden Markov model-based (HMMbased) TTS system. We show that in Yorùbá, which is considered a register tone language, the realisation of tone is not solely determined by pitch levels, but also inter-syllable and intra-syllable pitch dynamics. Furthermore, our experimental results indicate that utterance-wide pitch patterns are not only a result of cumulative local pitch changes (terracing), but do contain a significant gradual declination component. Lastly, models based on inter- and intra-syllable pitch dynamics using underlying linear pitch targets are shown to be relatively efficient and perceptually preferable to the current standard approach in statistical parametric speech synthesis employing HMM pitch models based on context-dependent phones. These findings support the applicability of the proposed models in under-resourced conditions.Doctora

    Automatic speech segmentation with limited data

    No full text
    Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009.The rapid development of corpus-based speech systems such as concatenative synthesis systems for under-resourced languages requires an efficient, consistent and accurate solution with regard to phonetic speech segmentation. Manual development of phonetically annotated corpora is a time consuming and expensive process which suffers from challenges regarding consistency and reproducibility, while automation of this process has only been satisfactorily demonstrated on large corpora of a select few languages by employing techniques requiring extensive and specialised resources. In this work we considered the problem of phonetic segmentation in the context of developing small prototypical speech synthesis corpora for new under-resourced languages. This was done through an empirical evaluation of existing segmentation techniques on typical speech corpora in three South African languages. In this process, the performance of these techniques were characterised under different data conditions and the efficient application of these techniques were investigated in order to improve the accuracy of resulting phonetic alignments. We found that the application of baseline speaker-specific Hidden Markov Models results in relatively robust and accurate alignments even under extremely limited data conditions and demonstrated how such models can be developed and applied efficiently in this context. The result is segmentation of sufficient quality for synthesis applications, with the quality of alignments comparable to manual segmentation efforts in this context. Finally, possibilities for further automated refinement of phonetic alignments were investigated and an efficient corpus development strategy was proposed with suggestions for further work in this direction.Master

    Palatalisation of /s/ in Afrikaans

    No full text
    This article reports on the investigation of the acoustic characteristics of the Afrikaans voiceless alveolar fricative /s/2^2. As yet, a palatal [\int] for /s/ has been reported only in a limited case, namely where /s/ is followed by palatal /j/, for example in the phrase is jy ( “are youˮ ), pronounced as [ǝ–\intǝi]. This seems to be an instance of regressive coarticulation, resulting in coalescence of basic /s/ and /j/. The present study revealed that, especially in the pronunciation of young, white Afrikaans–speakers, /s/ is also palatalised progressively when preceded by /r/ in the coda cluster /rs/, and, to a lesser extent, also in other contexts where /r/ is involved, for example across syllable and word boundaries. Only a slight presence of palatalisation was detected in the production of /s/ in the speech of the white, older speakers of the present study. This finding might be indicative of a definite change in the Afrikaans consonant system. A post hoc reflection is offered here on the possible presence of /s/–fronting, especially in the speech of the younger females. Such pronunciation could very well be a prestige marker for affluent speakers of Afrikaans
    corecore